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1 Abstract 

This paper deals with the speed of convergence of the learning curve in a Gaussian 
process regression framework. The learning curve describes the average generaliza- 
tion error of the Gaussian process used for the regression. More specifically, it is 
defined in this paper as the integral of the mean squared error over the input pa- 
rameter space with respect to the probability measure of the input parameters. The 
main result is the proof of a theorem giving the mean squared error in function of 
the number of observations for a large class of kernels and for any dimension when 
the number of observations is large. From this result, we can deduce the asymp- 
totic behavior of the generalization error. The presented proof generalizes previous 
ones that were limited to more specific kernels or to small dimensions (one or two). 
The result can be used to build an optimal strategy for resources allocation. This 
strategy is applied successfully to a nuclear safety problem. 



1 



Keywords: Gaussian process regression, asymptotic mean squared error, learning 
curves, generalization error, convergence rate. 



2 Introduction 

Gaussian process regression is a useful tool to approximate an objective function 
given some of its observations [Laslett, 1994| . Initially used in geostatistics to inter- 
polate a random field at unobserved locations [Wackernagel, 2003 , [Berger et al., 200T 



and [Gneiting et al., 2010 , it has been developed in many areas such that environ- 



mental and atmospheric sciences, meteorology and mining. 

This method has become very popular during the last decades especially to 
build surrogate models from noise-free observations. For example, it is widely used 
in the field of "computer experiments" to build models which surrogate an expensive 
computer code Sacks et al., 1989] . Then, through the fast approximation of the 



computer code, uncertainty quantification and sensitivity analysis can be performed 
with a low computational cost. 

Nonetheless, for many realistic cases, we do not have direct access to the function 
to be approximated but only to noisy versions of it. For example, if the objective 
function is the result of an experiment, the available responses - also called outputs 
- could be tainted by measurement noise. In that case, we could reduce the noise 
of the observations by repeating the experiments at the same locations. Another 
example is Monte-Carlo based simulators - also called stochastic simulators - which 
use Monte-Carlo or Monte-Carlo Markov Chain methods to solve a system of dif- 
ferential equations through its probabilistic interpretation. For such simulators, the 
level of the noise can be tuned by the number of Monte-Carlo particles used in the 
procedure. 

Gaussian process regression can be easily adapted to the case of noisy observa- 
tions. Furthermore, as seen in the previous paragraph, in many cases the amount 
of noise can be reduced under the condition of increasing cost in time. Therefore, if 
the budget is given, a trade off between the number and the accuracy of the obser- 
vations has to be made. For example, if the total budget is doubled, then we can 
decide either to repeat all the experiments at the same location or to carry out new 
experiments at new locations. For practical applications, a fundamental problem is 
hence to estimate the budget needed to reach an objective generalization error by 
taking into account the dependence between the observation noise variance and the 
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number of observations. This is also of theoretical interest since it deals with the 
size training sample dependence of the generalization error. 

Many authors were interested in obtaining learning curves describing the gener- 
alization error as a function of the training set size [Rasmussen and Williams, 2006 



The problem has been addressed in the statistical and numerical analysis areas. For 
an overview, the reader is referred to Ritter, 2000b| for a numerical analysis point of 
view and to [Rasmussen and Williams, 2006 for a statistical one. In particular, in 
the numerical analysis literature, the authors are interested in numerical differentia- 
tion of functions from noisy data (see [Ritter, 2000a and [Bozzini and Rossini, 2003| ). 
They have found very interesting results for kernels satisfying the Sacks- Ylvisaker 
conditions of order r [Sacks and Ylvisaker, 198T but only valid for 1-D or 2-D func- 
tions. 

In the statistical literature [Sollich and Halees, 2002 give accurate approxima- 
tions to the learning curve and [Opper and Vivarelli, 1999 and Williams and Vivarelli, 2000 



give upper and lower bounds on it. Their approximations give the asymptotic value 
of the learning curve. They are based on the Woodbury-Sherman-Morison matrix 
inversion lemma [Harville, 1997 which holds in finite-dimensional cases which cor- 
respond to degenerate kernels in our context. Nonetheless, classical kernels used 
in Gaussian process regression are non-degenerate and we hence are in an infinite- 
dimensional case and the Woodbury-Sherman-Morison formula cannot be used di- 
rectly. Another proof for degenerate kernels can be found in [Picheny, 2009 . 



The main result of this paper is a theorem giving the asymptotic value of the 
Gaussian process regression mean squared error for a large training set size when the 
noise observation variance depends on the number of observations. This asymptotic 
value is given as a function of the eigenvalues and eigenf unctions of the Karhunen- 
Loeve decomposition of the covariance kernel. From this theorem, we can deduce 
an approximation of the learning curve for non-degenerate and degenerate kernels 
(which generalizes results in [Opper and Vivarelli, 199"9 , [Sollich and Halees, 2002 



and [Picheny, 2009| ) and for any dimension (which generalizes results in Ritter, 2000b 



[Ritter, 2000a| and (B ozzini and Rossini, 2003| ). Finally, from this approximation 

we can deduce the rate of convergence of the best linear unbiased predictor (BLUP) 
in a Gaussian process regression framework. 

The rate of convergence of the BLUP is of practical interest since it provides 
a powerful tool for decision support. Indeed, from an initial experimental design 
set and if the number of observations is large enough, it can provide the amount 
of budget that we must add to reach a given desired accuracy. Nevertheless, the 
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theorem holds for a constant observation noise variance. Considering this variance 
as constant is often a reasonable assumption to provide a good estimation of the 
needed budget but to determine the best resource allocation it is worth taking into 
account the noise heterogeneity. We propose in this paper a theorem giving such a 
resource allocation. 

Toy examples are presented in this paper to compare the theoretical convergences 
given by the theorem and numerically observed convergences. Then, an industrial 
application to the safety assessment of a nuclear system containing fissile materials 
is performed. This real case emphasizes the effectiveness of the theoretical rate of 
convergence of the BLUP since it predicts a very good approximation of the budget 
needed to reach a prescribed precision. 

3 Gaussian process regression 

Let us suppose that we want to approximate an objective function x G M'^ — t- /(x) G 
M from noisy observations of it at points (xi)j=i,...,„s with Xi G Mf^. The points of the 
experimental design set (xj)i=i^,,.^„s are supposed to be sampled from the probability 
measure fi. We hence have ns observations of the form Zi = f{xi) + and 
we consider that (ei(xj))j=i^,,,^„s are independently and identically sampled from the 
Gaussian distribution with mean zero and variance nT{x): 

€{x) ^^{o,nT{x)) (1) 

Note that the number of observations and the observation noise variance are both 
controlled by n. It means that if we increase the number of observations according 
to n, we automatically increase the uncertainty on the observations. Furthermore, 
if we increase the number of observations according to s, the noise variance does not 
vary. We so have an interesting flexibility in our model which can be used for many 
real cases. 

Example 1 For an experiment with ri independent replications at points (xj)j=i^...^„s, 
we have responses of the form: 

1 

k=l 

where {Yk{x))k=i,...,r o-f^ identically distributed independent Gaussian variables with 
mean f{x) (the quantity of interest) and variance cr'^{x). Therefore, the variance 
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of Z[r^){xi) equals '^"^ and we have observations of the form = /(xj) + 

ei{xi) with ei{xi) ~ M {fd^rl^a1{xi)) . Let us suppose that we have a fixed budget 
T uniformly spread on the experimental design set - i.e. Ti = r \/i = 1, . . . ,ns - 
and equal to the number of experiments ns times the number of replications r - i.e. 
T = nsr. We so have: 

var(z(^)(x)) = Y^l{x) = nr(x) 

where r(x) = |;(T^(x) is a constant independent of n. In this framework the noise 
is sampled from the Gaussian distribution f/{0,nT{x)) . This is a typical example in 
which, when the total budget is fixed, the noise variance and the number of points 
are controlled by the same parameter n. 

The main idea of the Gaussian process regression is to suppose that the objective 
function f{x) is the reahzation of a Gaussian process with a known mean and a 
known covariance kernel k{x, x'). The mean can be considered equal to zero without 
loss of generality. Then, denoting by z"* = [/(xj) + Si{xi)]i<i<ns the vector of 
length ns containing the noisy observations, we choose as predictor the Best Linear 
Unbiased Predictor (BLUP) given by the equation: 

/(x) = k{xf{K + n^y\^% A = diag[(r(x,)),=i„„,„,] (2) 

where k{x) = [k{x,Xi)]i<i<ns is the ns-vector containing the covariances between 
Z{x) and Z{xi), 1 < i < ns and K = [k{xi,Xj)]i<ij<ns is the ns x ns-matrix 
containing the covariances between Z{xi) and Z{xj), I < i,j < ns. When r(x) is 
independent of x, we have A = rJ with / is the ns x ns identity matrix. The BLUP 
minimizes the Mean Squared Error (MSE) which equals: 

a\x) = k{x, x) - k{xf{K + nA)-^k{x) (3) 

Indeed, if we consider a Linear Unbiased Predictor (LUP) of the form a{x)'^z"'^, 
its MSE is given by: 

E[{Z{x) - a^{x)Z'''y] = k{x, x) - 2a{xfk{x) + a{xf{K + nA)a(a;) (4) 

where Z'"^ = [Z{xi) + Si{xi)]i<i<ns- The value of a(x) minimizing (jl]) is aopt(a;)"^ = 
k{xf{K + nAy\ Therefore, the BLUP given by Oopt (3^)^-2"'* is equal to ([2]) and 
by substituting a{x) with aopt(a;) in equation (jl]) we obtain the MSE of the BLUP 
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given by equation ([3]). 



The main result of this paper is the proof of a theorem providing the asymptotic 
value of a^{x) when n — )■ +00 and A = rJ. Thanks to this theorem, we can deduce 
the asymptotic value of the Integrating Mean Squared Error (IMSE) - also called 
learning curve or generalization error - when n — )■ +00. The IMSE is defined by: 

IMSE= [ a\x)dfx{x) (5) 

where /i is the measure of the input space parameters. The asymptotic value of 
the IMSE that we obtain can be viewed as a generalization of previous results (see 
IRasmussen and Williams, 2006| , [Ritter, 2000bl , [Ritter, 2000a| , [Bozzini and Rossini, 2003 



|Opper and Vivarelli, 1999 



[Sollich and Halees, 2002| and [Picheny, 2009| ). From it the convergence rate of the 
IMSE according to s (or equivalently T) is obtained. The speed of convergence of 
the learning curve can be used to determine the budget required to reach a pre- 
scribed accuracy. Note that the proof of the theorem holds for a constant noise 
observation variance r (which corresponds to an uniform allocation of the budget 
T). Nevertheless, to provide optimal resource allocation, it can be important to 
take into account the heterogeneity of the noise observation variance. We give in 
this paper under certain restricted conditions (i.e., when K is diagonal) the optimal 
allocation taking into account the noise heterogeneity. Moreover, we numerically 
observe that this allocation remains efficient in more general cases although it is not 
anymore optimal (it remains more efficient than the uniform one). 

4 Convergence of the learning curve for Gaussian 
process regression 

This section deals with the convergence of the BLUP when the number of observa- 
tions is large and the noise variance does not depend on x, i.e. t{x) = r and A = rJ. 
The speed of convergence of the BLUP is evaluated through the generalization error 
- i.e. the IMSE - defined in ([5]). The main theorem of this paper follows: 

Theorem 1 Let us consider Z{x) a Gaussian process with known mean and co- 
variance kernel k{x, x') G C^{W^ x W^) and (xj)j=i^...^„s an experimental design set 
of ns independent random points sampled with the probability measure dfi on M.'^. 
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According to Mercer's theorem {Mercer, 1909^ , we have the following representation 
of k{x, x'): 

k{x, x') = ^ \p(j)p{x)(j)p{x') (6) 

p>0 

where (</)p(a;))p is an orthonormal basis of consisting of eigenfunctions of 

{Tfj.,kf){x) = Jj^a k{x,x')f{x')dfi{x') and Xp is the nonnegative sequence of corre- 
sponding eigenvalues sorting in decreasing order. Then, for a non-degenerate kernel 
- i.e. when Ap > 0, Vp > - such that sup^.gigd k{x,x) < oo, we have the following 
convergence in probability for the MSE ^ of the BLUP: 



aHx) — > 



p>0 P 



For degenerate kernels, the convergence is almost sure. 

The sketch of the proof of Theorem [T] is given below. The full proof is given in 
Appendix \M 

Sketch of Proof. We first prove the theorem for degenerate kernels (see Appendix 
lAll) which was already known in that case. Next we find a lower bound for cr^(x) 
for non-degenerate kernels. Let us consider the Karhunen-Loeve decomposition of 
Z{x) = ^^px) ZpyO^(j)p{x) where {Zp)p is a sequence of independent Gaussian ran- 
dom variables with mean zero and variance 1. If we denote by aopt,i{x), i = 1, . . . ,ns, 
the coefficients of the BLUP associated to Z{x), the Gaussian process regression 
mean squared error can be written a^{x) = X]p>o i'^pi^) ~ Ylili O'opt,i{x)4>p{xi))^ ■ 
Then, for a fixed p, the following inequality holds: 

<^'^{x) > Ap I (t)p{x) - aopt,*(x)0p(xi) J = aljjpp{x) (8) 

p<p \ i=l / 

where, <y\^p,p{x) is the MSE of the LUP of coefficients aopt,i(a;) associated to the 



Gaussian process Zp{x) = J2p<p ^p^/K^^pi^) ■ Let us consider the MSE of the 

BLUP of Zp{x), we have the following inequality: 

Since Zp{x) has a degenerate kernel, Vp > 0, the almost sure convergence given in 
equation (13T1) holds for (see Appendix [All). Then, considering inequalities (E]) 
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and ([9]), the convergence (13T]) and the asymptotic p — t- oo, we obtain: 




Finally, we find an upper bound for cr^(x). Since cr^(x) is the MSE of the BLUP 
associated to Z{x), if we consider any other LUP associated to Z{x) its MSE denoted 
by crlijpix) satisfies the following inequality: 

a'ix) < alupix) (11) 

The idea is to find a LUP so that its MSE is a tight upper bound of a'^{x). Let us 
consider the LUP: 

hupix) = kixfAz^' (12) 

with A the ns x ns matrix defined hj A = L^^ + Yll=i{—^)^{L"^M)^L^^ with 

L = nrJ + X;p<p. Ap[0p(xi)0p(Xj)]i<i,j<„., M = Y.pyp* K[(t>p{^i)(Pp{^j)]i<i,j<ns, Q 
a finite integer and p* such that s\p* < r. The choice of this LUP is motivated 
by the fact that the matrix A is an approximation of the inverse of the matrix 
[nrl + K) that is tractable in the following calculations. Note that the BLUP is 
/blup(x) = k{xf{K + nriy^z'"'. Then, the MSE of the LUP is given by: 

2g+l 

alup{x) = k{x, x) - k{xfL-^k{x) - J2 {-iyk{xf{L-^MyL-^k{x) (13) 

Thanks to the Woodbury-Sherman-Morison formula^, the strong law of large num- 
bers and the continuity of the inverse operator in the space of p-dimensional invert- 
ible matrices, we have the following almost sure convergence: 

k{xfL-'k{x) "-^ Yl -^M^r + - E (14) 

^ — ' SA„ + T T ^ — ' 

p<p* ^ P>P* 

We note that we can use the Woodbury-Sherman-Morison formula and the strong 
law of large numbers since p* is finite and independent of n. Then, using the Markov 
inequality and the equality ^p>q\<Pp{xY = k{x,x) < oo, we have the following 
convergence in probability: 

k{xf{L-'MyL-'k{x) "-^ (-)'^' Yl K^^M^y (15) 

p>p* 

^li B is a non-singular p x p matrix, C a non-singular m x m matrix and A a m x p matrix 
with m,p <oo, then {B + AC-^A)-^ = B'^ - B-^A(A^B-^A + C)-^A^B-^. 
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We highlight that we cannot use the strong law of large numbers here due to the 
infinite sum in the definition of M. Finally, we obtain the following convergence in 
probability: 



limsupa^(x) < lim alup{x) = V f Ap 7^V-] ^vi^f-y^ ^^p^ — ^0p(a;) 

n-*-oo n-s^oo ' \ r + SAp J ■'— ^ T + SAp 

(16) 

By taking the limit g — )■ oo in the right hand side and using the inequality s\p* < r, 
we obtain the following upper bound for (T^(x): 



limsupa2(x)<5^(-^^) 



0p(x)2 (17) 



The result announced in Theorem [T] is deduced from the lower and upper bounds 
m and (E 



Remark 1 For non-degenerate kernels such that ||0p(x)||loo < oo uniformly in 
p, the convergence is almost sure. Some kernels such as the one of the Brownian 
motion satisfy this property. 

The following theorem gives the asymptotic value of the learning curve when n 
is large. 

Theorem 2 Let us consider Z{x) a Gaussian process with known mean and covari- 
ance kernel k{x,x') G C°(]R'^ x M.'^) and (xj)j=i^,,,^„s an experimental design set of ns 
independent random points sampled with the probability measure dfi on M"'. Then, 
for a non-degenerate kernel, we have the following convergence in probability: 



For degenerate kernels, the convergence is almost sure. 

Proof. From the Theorem [1] and the orthonormal property of the basis {(f)p{x))p 
in L'jj^ix), the proof of the theorem is straightforward by integration. We note that 
we can permute the integral and the limit thanks to the dominated convergence 
theorem since a'^{x) < k{x,x). ■ 
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Remark 2 The obtained limit is identical to the one established in Rasmussen and Williams, 2006 



and [Picheny, 2009 for a degenerate kernel. Furthermore, [Qpper and Vivarelli, 1999 



gives accurate upper and lower bounds for the asymptotic behavior of the IMSE for 
a degenerate kernel too. The originality of the presented result is the proof giving 
the asymptotic value of the learning curve for a non-degenerate kernel. We observe 
that the asymptotic value is the same as in the degenerate case but with a weaker 
convergence. We note that this result is of practical interest since the usual kernels 
for Gaussian process regression are non-degenerate and we will exhibit dramatic 
differences between the learning curves of degenerate and non-degenerate kernels. 

Proposition 1 Let us consider lim„^ooIMSE = IMSEqo- The following inequality 
holds: 

^B, < IMSEoo < (19) 
with B, = J2p s.t. x,<L Ap + 7# {p s.t. \p>-J. 

Proof. The proof is directly deduced from the Theorem [2] and the following inequal- 
ity: 

1 X 



with: 



T ^ T 

-X X < -k 

T 1 

1 x> ^ 



Remark 3 The inequality given in Proposition [T] provides a lower bound more 
precise in our case than the one given in [Micchelli and Wahba, 19^T| - X]p>s+i \ — 
IMSEqo- This lower bound corresponds to the case of an experimental design set 
made with the points of a Gaussian quadrature whereas in our case it is sampled 
randomly from a given measure. 



5 Examples of rates of convergence for the learning 
curve 

Proposition [T]shows that the rate of convergence of the generalization error when n is 
large is the same as the one of Bg. Furthermore, we can see from (fT9|) that it depends 
on the one of the eigenvalues (Ap)p>o. We give some examples of convergence in the 
following Subsection. 
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5.1 Rates of convergence for some usual kernels 

In this Subsection we deduce the rates of convergence of the learning curve for some 
usual kernels from the asymptotic decays of their eigenvalues. 

Example 2 (Degenerate kernels) For degenerate kernels we have # {p s.t. Aj, > 0} < 
oo. Thus, when s — > oo, we have: 

p s.t. Ap<^ 

from which: 

r 

Bsoc - 
s 

Therefore, the IMSE decreases as -. We find here a classical result about Monte- 

' s 

Carlo convergence which gives that the variance decay is inversely proportional to the 
number of observations whatever the dimension. Nevertheless, for non-degenerate 
kernels, the number of non-zero eigenvalues is infinite and we are hence in an infinite- 
dimensional case (contrary to the degenerate one). We see in the following examples 
that we do not conserve the usual Monte-Carlo convergence rate in this case which 
emphasizes the importance of Theorem [1] dealing with non- degenerate kernels. 

Example 3 (The fractional Brownian motion) Let us consider the fractional 
Brownian kernel with Hurst parameter H G (0, 1): 

A;(a:,2/) = x^^ + y^^^- (20) 

The associated Gaussian process - called fractional Brownian motion - is Holder 
continuous with exponent H — e,\/e > 0. According to [Bronski, 2003 , we have the 
following result: 

Proposition 2 The Karhunen-Loeve eigenvalues of the fractional Brownian motion 
with Hurst exponent HE (0, 1) satisfy the behavior 

l/ff ( (2g+2)(4g+3) .\ 



where 6 > is arbitrary, uh = '^'"^"^^^p^+'f "'"'''^ , and T is the Euler Gamma function. 
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Therefore, when s ^ 1, we have: 



\p< - if p> [—^) 



s 

We hence have the following approximation for Bg: 



5. 



Furthermore, we have: 



+00 



+ 1 



from which: 

CH,r . 1 , S > 1 

where Ch,t is a constant independent of s - i.e. the number of observations. 

The rate of convergence for a fractional Brownian motion with Hurst parameter 
H is ^ 1 . We note that the case H = \ corresponds to the classical Brownian 
motion and the given rate stands when fi is the uniform measure over [0, 1]. We 
observe that the larger the Hurst parameter is (i.e. the more regular the Gaussian 
process is), the faster the convergence is. Furthermore, for — ?■ 1 the convergence 
rate gets close to Therefore, even for the most regular fractional Brownian 
motion, we do not reach the classical Monte-Carlo convergence rate. 

Example 4 (The 1-D Matern covariance kernel) In this example we deal 
with the Matern kernel with regularity parameter z/ > in dimension 1: 

/cid(x,x>,/) = -— n— ^ K.y^^^ ^1 (21) 



where Ky is the modified Bessel function [Abramowitz and Stegun, 1965 . The 



Karhunen-Loeve eigenvalues of this kernel satisfy the following asymptotic |A.I and Nikitin, 2004 



Following the guideline of the Example 3 we deduce the following asymptotic be- 
havior for Bg. 

Bs ~ C—^, s > 1 

S 2,. 
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where C is a constant independent of s. 

This result is in agreement with the one of [Ritter, 2000a who proved that for 



1-dimensional kernels satisfying the Sacks- Ylvisaker of order r conditions (where r 
is an integer), the generalization error for the best linear estimator and experimental 
design set strategy decays as ^ ^ i . Indeed, for such kernels, the eigenvalues satisfy 

s 

the large p asymptotic Ap oc Ritter et al., 1995 and by following the guideline 



of the previous examples we find the same convergence rate. Furthermore, our 
result generalizes the one of [Ritter, 2000a| since it provides convergence rates for 
more general kernels (e.g. for r G (0,+oo)) and for any dimension (see below). 
Furthermore, it shows that the random sampling gives the same decay rate as the 
optimal experimental design. 

Example 5 (The d-D tensorised Matern covariance kernel) We focus here 
on the d-dimensional tensorised Matern kernel with isotropic regularity parameter 
> |. According to [Pusev, 2011 the eigenvalues of this kernel satisfy the asymp- 



V 

totics: 



Ap~0(p), P>1 
where the function is defined by: 

log(l + p)2('^-l)^ 



0(p) 

Its inverse d)"^ satisfies: 



^2v 



1(e) =£-277 (log (£-277 ) ) (1 + 0(1)), £<1 



We hence have the approximation: 



S.^——^log 1 + 0-1^ ' "--'^ 



We can deduce the following rate of convergence for Bg. 



S 2^ 



with C a constant independent of s. 
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Example 6 (The d-D Gaussian covariance kernel) According to |Todor, 2006 
tlie asymptotic beliavior of tlie eigenvalues for a Gaussian kernel is: 



Applying the procedure presented in the previous examples, it can be shown 
than the rate of convergence of the IMSE is bounded by: 



where Ci and C2 are constants independent of s. 

Remark 4 We see in the previous example that for smooth kernels, the conver- 
gence rate is close to s~^, i.e. the classical Monte-Carlo one. 

5.2 Illustrations 

In this Subsection we compare the previous theoretical results on the rate of con- 
vergence of the generalization error with full numerical simulations. 

In order to observe the asymptotic convergence, we fix n = 200 and the number 
of observations is ns = n, . . . , lOn. The experimental design sets are sampled from 
a uniform measure on [0, 1] and the observation noise is ?ir = 1. To estimate the 
IMSE ([5]) we use a numerical integration with 4000 quadrature points. 

First, we deal with the 1-D fractional Brownian kernel fl20|) with Hurst parameter 





s 



s 



H. We have proved that for large n, the IMSE decay as 



. Figure [T] compares 



the numerically estimated convergences to the theoretical ones. 
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500 1000 1500 2000 500 1000 1500 2000 



Number of observations Number of observations 

Figure 1: Rate of convergence of the IMSE when the number of observations in- 
creases for a fractional Brownian motion with Hurst parameter H = 0.5 (left) and 
H = 0.9 (right). The initial number of observations is n = 200 and the observation 
noise variance is nr = 1. The triangles represent the numerically estimated IMSE, 
the solid line represents the theoretical convergence, and the other non-solid lines 
represent various convergence rates. 



We see in figured] that the observed rate of convergence is perfectly fitted by the 
theoretical one. We note that we are far from the classical Monte-Carlo rate since 
we are not in a non-degenerate case. 

Finally, we deal with the 2-D tensorised Matern-| kernel and the 1-D Gaussian 
kernel. The 1-dimensional Matern-i^ class of covariance functions kiD{t,t'\v,9) is 
given by fl2T]) and the 2-D tensorised Matern-i^ covariance function is given by: 

k{x, x'\ V, 9) = hoixi, x[; z/, 9i)kiD{x2, x'^; 92) (22) 

Furthermore, the 1-D Gaussian kernel is defined by: 

k{x, X ; ^) = exp I — — 

Figure [2] compares the numerically observed convergence of the IMSE to the theo- 
retical one when 6*1 = ^2 = 0.2 for the Matern-| kernel and when 9 = 0.2 for the 
Gaussian kernel. We see in figure [2] that the theoretical rate of convergence is a 
sharp approximation of the observed one. 
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500 1000 1500 2000 500 1000 1500 2000 

Number of observations Number of observations 



Figure 2: Rate of convergence of the IMSE when the number of observations in- 
creases for a 2-D tensorised Matern-| kernel on the left hand side and for a 1- 
D Gaussian kernel on the right hand side. The initial number of observations is 
n = 200 and the observation noise variance is nr = 1. The triangles represent the 
numerically estimated IMSE, the solid line represents the theoretical convergence, 
and the other non-solid lines represent various convergences. 

6 Optimal resource allocation for heterogeneous ob- 
servation noise variance. 

In this section, we deal with the budget T defined as the sum of repetitions on all 
points of the experimental design set - i.e. T = Yl^li "^i with r, an integer repre- 
senting the number of repetitions conceded to the points Xj - when the observation 
noise variance is inversely proportional to the number of repetitions r^. 

In the previous Section, we have studied the convergence of the IMSE according 
to s (or equivalently T) which can be of practical interest to determine the needed 
budget T to achieve a prescribed precision. To determine this budget T we have 
made the assumption of a noise variance nr^x) independent of x and we have con- 
sidered the uniform allocation = r as in Example [1] Despite the fact that these 
assumptions are needed to determine the budget T, in order to provide the optimal 
resource allocation - i.e. the sequence of integers {ri, r2, . . . , r^^} minimizing the 
generalization error - it is worth taking into account the heterogeneity of the noise. 
For a Monte-Carlo based simulator, the number of repetitions r could represent the 
number of MC particles and the procedure presented below can be applied. 

Determining the optimal allocation of the budget T whatever the Gaussian pro- 
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cess for a heterogeneous noise is an open and non-trivial problem. To solve this 
problem, we first consider the continuum approximation in which we look for an 
optimal sequence of real numbers (rj)j=i^...^„s and then we round the optimal solu- 
tion to obtain a quasi-optimal integer-valued allocation (ri^int)j=i,...,ns- The following 
proposition gives the optimal resource allocation under certain restricted conditions 
for the continuous case. The reader is referred to [Munoz Zuniga et al., 20 iT for a 
proof of this proposition in a different framework (the proof uses the Karush-Kuhn- 
Tucker approach to solve the minimization problem with equality and inequality 
constraints). 



Proposition 3 Let us consider Z{x) a Gaussian process with a known mean and 
covariance kernel k{x,x') G C°(M'^ x M"'). Let (xi)j=i,,.,^„s be an experimental de- 



k(xj,Xj)+no-^{xj) 



IS non- 



sign set of ns points sorted such that the sequence . , 

^c{xj)nal(xj) I 

increasing, where na1{xi) is the noise variance of an observation at point Xi and 
c{x) = k{x' , x)"^ dfi{x') . When the covariance matrix K is diagonal, the real- 
valued allocation (rj)j=i^... minimizing the generalization error: 



IMSE 



k{x, x) - k{x f{K + nA)"^A;(x) d^i{x) 



under the constraints 'Y^^Li ^i = T and ri > 1, \/i = 1, . . . ,ns is given by: 
1 



opt 



^yc{xi)a^{xi) 



k{xi^xi) \ 

Ens V ^ . 



^^^^ T-z*+nE"-.. 



ns al {xj ) 



+1 k{xj,Xj) 



na;[Xi) 



where A = diag 



i=l,...,ns 



and: 



(23) 



I < i* 



(24) 



. =max<:^ = l suchthat ^jx., x^^ + na^jx.^ ^ T ^ + Ej^.^i g^gj 

2^j=i+l k{xj,Xj) 



^yc{xi)na^{xi) 



By convention, if: 

k{xi,Xi) + nal{xi) 
^J c{xi)nal{xi) 



< 



ns nal (xj ) 
i=j+l k{xj,Xj) 



Ens 
j='i 



y/c{xj)ncTl{xj) 



j=i+l k{xj,Xj) 



Vz = 1, 



, ns 



(25) 



(26) 



then i* = 0. 
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The optimization problem in Proposition[3]admits a solution if and only if T > ns 
which reflects the fact that ns simulations are already available. Furthermore, when 
T is large enough, we have z* = and the solution has the following form: 



^opt 



2^1 = 1 k(x,.xA ^ ^ 



(27) 



\2^j = l k{Xj,Xj) \ J / J 

While Proposition [2] gives a continuous optimal allocation, an admissible allo- 
cation must be an integer-valued sequence. Therefore, as mentioned previously, 
we solve the optimization problem with the continuous approximation and then we 
round the continuous solution to obtain a quasi-optimal integer- valued solution r°?*^. 
The rounding is performed by solving the following problem: 

Find J such that Yl^li ''"unt — ^ with: 



„opt ^ 
ijint r opt] 



[r°P*]+l z<J 
[rT] ^ > J 



where [x\ denotes the integer part of a real number x. 

We note that this allocation is not optimal in general (i.e. when K is not diago- 
nal). Nevertheless we have numerically observed that it remains efficient in general 
cases and is often better than the uniform allocation strategy. We note that the nu- 
merical comparison has been performed with different kernels (Gaussian, Matern-|, 
Matern-|, exponential, Brownian and triangular [Rasmussen and Williams, 2006| ) 
and in dimension one and two with a number of observations varying between 10 
and 400. Furthermore, two types of experimental design set have been tested, one is 
a random set sampling from the uniform distribution and the other one is a regular 
grid. 

Proposition [3] shows that it is worth allocating more resources at locations 
where the noise variance is more important. Furthermore, the quantity c(xj) = 
Jjgd A;(x, Xj)^ (i/i(x) - controlled by the measure ci/i and the kernel k{x,Xi) - can be 
viewed as the local density around the point Xj. The proposition emphasizes that 
it is better to allocate more resources where this local density is more important. 
Finally, we see that when the total budget T is large enough, the coefficient i* equals 
and the optimal resources allocation for is identical as the one obtained without 
the inequality constraints. 
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7 Industrial Case: code MORET 



We illustrate in this section an industrial application of our results about the rate 
of convergence of the IMSE. The case is about the safety assessment of a nuclear 
system containing fissile materials. The system is modeled by a neutron transport 



system of dry PUO2 storage. 

This section is divided into 3 parts. First, we present the Gaussian process regres- 
sion model built on an initial experimental design set. Then, thanks to Proposition 
[1] about the rate of convergence of the IMSE, we determine given a target precision 
the needed computational budget T . Finally, we allocate the resource T on the 
experimental design set. 

7.1 Data presentation 

The benchmark system safety is evaluated through the neutron multiplication factor 
/Ccff- This factor models the criticality of a chain nuclear reaction: 

• /ceflf > 1 leads to an uncontrolled chain reaction due to an increasing neutron 
population. 

• ^cflF = 1 leads to a self-sustained chain reaction with a stable neutron popula- 



• /ceflf < 1 leads to a faded chain reaction due to an decreasing neutron popula- 



The neutron multiplication factor depends on many parameters and it is evaluated 
using the stochastic simulator called MORET. We focus here on two parameters: 

• '^Pu02 ^ [0.5, 4]g.cm~^, the density of the fissile powder. It is scaled to [0, 1]. 

• c^watcr G [0, l]g.cm~^, the density of water between storage tubes. 

The other parameters are fixed to a nominal value given by an expert and we use 
the notation x = ((ipu02) c^water)- 

The MORET code provides outputs of the following form: 



code called MORET [Fernex et al., 2005 . In particular, we study a benchmark 



tion. 



tion. 




i=l 
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where {Yi{x))i=i^,,,^r are realizations of independent and identically distributed ran- 
dom variables which are themselves obtained by an empirical mean of a Monte-Carlo 
sample of 4000 particles. From these particles, we can also estimate the variance 
cr^(x) of the observation Yi{x) by a classical empirical estimator. The simulator 
gives noisy observations and the variance of an observation k^s^rix) equals a'^{x)/r. 

A large data base (^j(a:j))j=i^...,200j=i,...,5625 is available to us. We divide it into 
a training set and a test set. Let us denote by Yi{xj) the i*"^ observation at point 
Xj - the 5625 points Xj of the data base come from a 75 x 75 grid over [0, 1]^. The 
training set consists of ?i = 100 points (a;*'''^^'^)j=i....,„, extracted from the complete 
data base using a maximin LHS and of the first observations (yi(x*'^^™))j=i.,,,^ioo- 
We will use the other 5525 points testing set. 

The aim of the study is - given the training set - to predict the budget needed 
to achieve a prescribed precision for the surrogate model and to allocate optimally 
these resources. More precisely, let us denote by rj the resource allocated to the 
point xj^^^^ of the experimental design set. First, we want to determine the budget 
T = Yl^=i which allows us to achieve the target precision. Second, we want to 
determine the best resource allocation {rj)j=i^,,,^n- 

To evaluate the needed computational budget T the noise variance cr^(x) is con- 
sidered as a constant in order to fit with the hypotheses of the theorem. The constant 
variance equals the mean J^^p a'^{x) dfi{x) of the noise variance which is here esti- 
mated by = ^i=i '^e ('^F'^™) ~ 3.3. 10^'^. Furthermore, we look for a uniform 
budget allocation, i.e. = r Vj = 1, . . . , n. In this case, the total computational 
budget is T = nr. 

7.2 Model selection 

To build the model, we consider the training set plotted in figure HI It is composed 
of the n = 100 points (a;*™")j=i^...^„ which are uniformly spread on Q = [0, 1]^. 

Let us suppose that the response is the realization of a Gaussian process with 
a tensorised Matern-i/ covariance function. The 2-D tensorised Matern-z/ covari- 
ance function k{x, x'; u, 9) is given in (l22l) . The hyper-parameters are estimated by 
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maximizing the concentrated Maximum Likelihood [Stein, 1999 : 

~{z - mfia^K + all)-\z - m) - ]^det{a^K + all) 

where K = [k{xf'^^^, xj^^'^; u, 6')]jj=i. / is the identity matrix, cr^ the vari- 
ance parameter, m the mean of kcs^r{x) and z = . . . , Yi(xJ^''^™)) the 
observations at points in the training set. The mean of kcs,r{x) is estimated by 
'^=TMEj=in(xf-) = 0.65. 

Due to the fact that the convergence rate is strongly dependent of the regular- 
ity parameter u, we have to perform a good estimation of this hyper-parameter to 
evaluate the error model decay accurately. Note that we cannot have a closed form 
expression for the estimator of cx^, it hence has to be estimated jointly with 6 and 
u. 

Let us consider the vector of parameters (p = (z/, 6'i, 6*2, o"^). In order to per- 
form the maximization, we have first randomly generated a set of 10,000 parameters 
(0fc)fc=i,...,io4 on the domain [0.5,3] x [0.01,2] x [0.01,2] x [0.01, 1]. We have then se- 
lected the 150 best parameters (i.e. the ones maximizing the concentrated Maximum 
Likelihood) and we have started a quasi-Newton based maximization from these pa- 
rameters. More specifically, we have used the BFGS method [Shanno, 1970| . Finally, 
from the results of the 150 maximization procedures, we have selected the best pa- 
rameter. We note that the quasi-Newton based maximizations have all converged 
to two parameter values, around 30% to the actual maximum and 70% to another 
local maximum. 

The estimation of the hyper-parameters are i/ = 1.31, 6i = 0.67, 62 = 0.45 
and cr^ = 0.24. This means that we have a rough surrogate model which is not 
differentiable and a-Holder continuous with exponent a = 0.81. The variance of 
the observations is al = 3.3.10"^, using the same notations as Theorem [H we have 
al = nr therefore r = 3.3.10"^ and n = 100. 

The IMSE of the Gaussian process regression is IMSEtj, = 1.0.10"'^ and its 
empirical mean squared error is EMSEjjj = 1.2.10"'^. To compute the empirical 
mean squared error (EMSE), we use the observations (1^ (xj ) )i=i^. .. ^200, j=i... ,5525 with 
Xj 7^ x^rain ^ 1, . . . , 100, j = 1, . . . , 5525 and to compute the IMSE we use a 
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trapezoidal numerical integration into a 75 x 75 grid over [0, 1]^. For r = 200, 

2 

the variance of the output kes^r{x) equals ^ = 1.64.10"^ and is neglected for 
the estimation of the empirical error. We can see that the IMSE is close to the 
empirical mean squared error which means that our model describes the observations 
accurately. 



7.3 Convergence of the IMSE 

According to Proposition [H we have the following convergence rate for the IMSE: 

IMSE ~ i^^^ (28) 

where the model parameter u plays a crucial role. We can therefore expect that the 
IMSE decays as: 

i°g(r) 

IMSEr = IMSE^„|g (29) 

-'o 

Let us assume that we want to reach an IMSE of 2.10"^. According to the 
IMSE decay and the fact that the IMSE for r = 1 has been estimated to be equal to 
1.0.10"^, the total budget required is T = nr = 3600, i.e. r = 36. Figure [3] illustrates 
the empirical mean squared error convergence and the predicted convergence fl29|) 
of the IMSE. 
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Figure 3: Comparison between Empirical mean squared error (EMSE) decay and 
theoretical IMSE decay for n = 100 when the total budget T increases. The tri- 
angles represent the Empirical MSE, the solid line represents the theoretical decay, 
the horizontal dashed line represents the desired accuracy and the dashed line the 
classical M-C convergence. We see that Monte-Carlo decay does not match the 
empirical MSE and it is too fast. 

We see empirically that the EMSE of 2.10^'^ is achieved for r = 31. This shows 
that the predicted IMSE and the empirical MSE are very close and that the selected 
kernel captures the regularity of the response accurately. 

Let us consider the classical Monte-Carlo convergence rate j;, which corresponds 
to the convergence rate of degenerate kernels, i.e. in the finite -dimensional case. 
Figure [3] compares the theoretical rate of convergence of the IMSE with the clas- 
sical Monte-Carlo one. We see that the Monte-Carlo decay is too fast and does 
not represent correctly the empirical MSE decay. If we had considered the rate of 
convergence IMSE ~ we would have reached an IMSE of 2.10"^ for r = 6 (which 
is very far from the observed value r = 31). 

7.4 Resources allocation 

We have determined in the previous section the computational budget required to 
reach an IMSE of 2.10"^^. We observe that the predicted allocation is accurate 
since it gives an empirical MSE close to 2.10"^. To calculate the observed MSE, 
we uniformly allocate the computational budget on the points of the training set. 
We know that this allocation is optimal when the variance of the observation noise 
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is constant. Nevertheless, we are not in this case and to build the final model we 
allocate the budget taking into account the noise level. We note that the observation 
noise is denoted by cr^(x), the total budget is T = Yl^=i''^i where n = 100 is the 
number observations and the budget allocated to the point xf^^^. 

From (|27|) , when the input parameter distribution /i is uniform and for a diagonal 
covariance matrix, the optimal allocation is given by: 



(30) 



We have numerically observed that this allocation remains efficient in more gen- 
eral cases although it is not anymore optimal. Here we use this allocation to build 
the model. Let us consider that we do not have observed the empirical MSE decay, 
we hence consider the budget given by the theoretical decay T = 3600. The allo- 
cation given by equation (130|) is illustrated in figure H] with the contour of the noise 
level. 
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Figure 4: On the left hand side: initial experimental design set with n = 100. On the 
right hand side: noise level dependence of the resources allocation. The solid lines 
represent the noise variance contour plot and the numbers represent the resources 
allocated to the points of the experimental design set. 



We see in figurelHthat the resources allocation is more important at points where 
the noise variance is higher. Table [1] compares the performance of the two allocations 
on the test set. 
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Uniform Allocation 


Optimal Allocation 


MSE 


1.94.10"^ 


1.86.10"^ 


MaxSE 


3.66.10-2 


3.38.10-2 



Table 1: Comparison between uniform and optimal (under the condition K diago- 
nal) allocation of resources. 

We see in Table [T] that the budget allocation given by the equation (150]) gives 
predictions slightly more accurate than the uniform one. 
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9 Conclusion 

The main result of this paper is a theorem giving the Gaussian process regression 
mean squared error as a function of the training size when the number of observations 
is large. This asymptotic value of the mean squared error is derived in terms of 
the eigenvalues and eigenfunctions of the Karhunen-Loeve decomposition of the 
covariance function and holds for degenerate and non-degenerate kernels and for 
any dimension. 

From this theorem, we can deduce the asymptotic behavior of the generalization 
error - defined in this paper as the Integrated Mean Squared Error - as a function 
of the size of the training set. This result generalizes previous ones which give this 
behavior in dimension one or two or for a restricted class of covariance kernels (for 
degenerate ones). The significant differences between the rate of convergence of 
degenerate and non-degenerate kernels highlight the relevance of our theorem which 
holds for non-degenerate kernels. This is especially important as usual kernels for 
Gaussian process regression are non-degenerate. 

Our work deals with Gaussian process regression when the variance of the noise 
can be reduced by increasing the budget. Our results are of practical interest in 
this case since it gives the total budget needed to reach a precision prescribed by 
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the user. Nonetheless, it holds under the assumptions of homoscedastic observation 
noise. Despite the fact that this assumption is relevant to evaluate the budget, 
it is not optimal to determine the resources allocation. Indeed, in this case it is 
worth taking into account the noise variance heterogeneity and using a non-uniform 
allocation. We describe the resulting error reduction under restricted conditions. 
We have observed on test cases that our non-uniform allocation is better than the 
uniform one in more general cases although it is not optimal anymore. 



A Proof of the main theorem 

A.l Proof of Theorem [It the degenerate case 

The proof in the degenerate case follows the lines of the ones given by [Opper and Vivarelli, 1999 



Rasmussen and Williams, 2006 and [Picheny, 2009 . For a degenerate kernel, the 



number p of non-zero eigenvalues is finite. Let us denote A = diag(Aj)i<j<p, = 

((/)i(a;), . . . , 4>p{x)) and $ = ^ (pi^XiY ■ ■ ■ (p^XnsY ) • "^^^ mean squared error of 
the Gaussian process regression is given by: 

= 0(x)A(/)(x)^ + (/)(x)A<l>^ ($A$^ + nr/)"^ $A(/)(a;)^ 

Thanks to the Woodbury-Sherman-Morison formula^ and according to [Opper and Vivarelli, 199"9 
and [Picheny, 2009| the Gaussian process regression error can be written: 



a^fx) = 0(x) ( ^^ + A-M 0(xf 
' nr 



Since p is finite, by the strong law of large numbers, the entries of the p ^ p matrix 
converge. We so have the following almost sure convergence: 



-A 



(31) 

P<P ^ ^ -^^P 



A. 2 Proof of Theorem [Tt the lower bound for a'^{x) 

The objective is to find a lower bound for a'^{x) for non-degenerate kernels. Let 
us consider the Karhunen-Loeve decomposition of Z{x) = ^p>o 'Zip\f\,(\)y{x) where 

B is a non-singular p x p matrix, C a non-singular m x m matrix and A a m x p matrix 
with m,p <oo, then {B + AC-^A)-^ = B'^ - B-^A{A^B-'^A + C)-^A^B-^. 
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{Zp)p is a sequence of independent Gaussian random variables with mean zero and 
variance 1. If we denote by aj(x) the coefficients of the BLUP associated to Z{x), 
the mean squared error can be written 



E 



E 



Z{x) - y^^ai{x)Z{ 



i=l 



vp>0 



1=1 





2" 







^ Xp [ (ppix) - ai{x)(j)p{xi) 

p>0 



i=l 



Then, for a fixed p, the following inequality holds: 



p<p \ i=l 



X) 



(32) 



<j\uPp{x) is the MSE of the LUP of coefficients ai{x) associated to the Gaussian 
process Zp{x) = Ylip<p ^p\/\>4'p{^) ■ Let us consider a^^x) the MSE of the BLUP of 
Zp(x), we have the following inequality: 



(^LUP,pi.x) > ^l(a^) 



(33) 



Since Zp{x) has a degenerate kernel, the almost sure convergence given in equation 
(13T]) holds for cr|(x). Then, considering inequalities (132|) and (133|) and the conver- 
gence (13T]) . we obtain: 



rXr 



P<P ^^^'^P 



(ppixf 



(34) 



lim inf o"^ (x) > / ( - 

n—^oo ^ — ^ \ ' 
p<p ^ 

Taking the limit p — > oo in the right hand side gives the desired result. ■ 

A. 3 Proof of Theorem the upper bound for a'^{x) 

The objective is to find an upper bound for a'^{x). Since a^{x) is the MSE of the 
BLUP associated to Z{x), if we consider any other LUP associated to Z[x) its MSE 
denoted by o"|^p satisfies the following inequality: 



(35) 
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The idea is to find a LUP so that its MSE is a tight upper bound of (T^(x). Let us 
consider the LUP: 

hupix) = kixfAz'^' (36) 

with A the ns x ns matrix defined hy A = L^^ + J2l^^{—1)''{L^^M)''L"^ with 

L = riTl + Xlp<p* -^p[0p(a;i)0p(a;j)]i<ij<„s, M = J2p>p* Ky^pi^i)<f'pi^j)h<id<ns, Q a 
finite integer and p* such that sXp* < r. The matrix A is an approximation of the 
inverse of the matrix L + M = nrl + K. Then, the MSE of the LUP (!36l) is given 
by: 

alup{x) = k{x, x) - k{xf {2A - A{nTl + K)A) k{x) 
and by substituting the expression of A into the previous equation we obtain: 



2g+l 



a 



LUP 



» = k{x,x) - k{xfL-^k{x) - J^i-'^ynxfiL'^MyL-^kix) 



(37) 



First, let us consider the term k{x)^L ^k{x). Since p* < oo, the matrix L can 
be written: 



L = nTl + $p.A$p, 



(3^ 



where A = diag(Ai)i<i<p., ^p* = 0(xi)^ . . . (pixnsV ) and (f){x) = (0i(x), . . . , 
Thanks to the Woodbury-Sherman-Morison formula, the matrix is given by: 



/ / $j>p^ ^ 



nr TIT 



riT 



p* 
nr 



(39) 



From the continuity of the inverse operator for invertible p* x p* matrices and by 
applying the strong law of large numbers, we obtain the following almost sure con- 
vergence : 



k{xyL-^k{x 



^ ns 1 ^ 

— V/c(x,Xi)2 - — V 
nr ^-^ ^-^ 

j=i p,'?=i 



P.9 



X 



- ^ A;(x, Xi)0p(xi) -^A;(x,Xj)0g(xj 



p,(j=i 



E 



E^[A;(x, X)0p(X)]E^[A;(x, X)0,(X)] 



where is the expectation with respect to the distance /i. We note that we can 
use the Woodbury-Sherman-Morison formula and the strong law of large numbers 
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since p* is finite and independent of n. Tlien, tlie ortlionormal property of the basis 
(0p(x))p>o implies: 

p>0 

Therefore, we have the following almost sure convergence: 

k{xfL-^k{x) T^M"^^" + 7 E KM^)" (40) 

p<p* P p>p* 

Second, let us consider the term ^^^^^(— We have 
the following equality: 



where: 



L' = 



(41) 



with c/^"^, 



^^^^ + A-i 



. Since g < oo, we can obtain the convergence 



J p,p' 



in probability of X]j=t i—^yk{x)'^{L ^MyL ^k{x) from the ones of: 



k{x) 



n \ n J 



k{x) 



(42) 



and: 



kixY - 



\n I \ n 



(43) 



with i <2q+l and j < i. Let us consider k{x)^^ ("iJ^) ^(^) ^ ^ 

have: 



1 /MV' /L'M 



n \ n J \ n 



/i;(x) 



y , . . . ti^"^ , y si^i (44) 



Pl,...,pi-j<p* 

p[,-,p',-,<p* 



qi,...,qi-j>p* 
m\,...,mj>p* 
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with: 



qin) 



mi 



r=l 
ns 



n 



r=l 



X 



r=l 



E 

r=l 



'mi 



r=l 



A, 



n 

1=1 r=l 



We consider now the term: 

An) 



ns ^ /tJ 

= —'YM^k)M^k)-'Y(pp'{xk)(f)q{xk) (45) 



ns 



n 



n 



k=l k=l 

with < p*. From Cauchy Schwarz inequahty and thanks to the following in- 
equality: 

— ^ Ap/|0p.(x)|2 = ^p^Hx,x) 



(x)\^< 



we obtain: 



P p'>0 



n ^ 



"^qKXi] 



i=l 



with cr^ = sup^. x). Considering the expectation with respect to the distribution 
of points Xj, we obtain Vp < oo: 



.<?>p 



a 



(n) 



5>P 



From Markov inequality, V(5 > 0, we have: 



q>p 



> 5 < 



^q>p q,p,p' 



< 



p* /-^q>p 



(46) 



Furthermore, V(5 > 0, Vp > p*: 



P 



E 

9>P* 



g,p,p' 



> 25 < P 



p*<q<p 



>6 \ +F 



E' 

g>p 



9>P>P' 



> 5 



We have for all q G (p*,p] : %ppi Ciq,p,p' = \s^5q=p5q=pi = 0, as n — > oo, therefore: 



limsupP 

n— >oo V 



E' 

q>p* 



in) 

q,p,p' 



>25\ < 
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Taking the limit p — > oo in the right hand side, we obtain the convergence in 
probabihty of X]g>p* '^g?p,p' when n oo: 

^ ns ^ ns 

n ^-^ n ^-^ 

q>p* r=l r=l 

Following the same method, we obtain the convergence: 



V 



Y.[^Y. ^(^' ^r)(i>q{Xr) (i>p{Xr)(l>q{Xr) 1^0 Vp < p* (48) 

g>p* \ r=l r=l / 

Let US return to S^m- By using Cauchy Schwarz inequality and bounding by the 
constant M all the terms independent of qi and m^, we obtain: 



gi,...,<ji_j>p* 



< 



3 ^ ns 



r=l 



r=l 



r=l 



, ns 

A 



qi,...,qi-j-l>p* 1=1 r=l 



r=l 



Since X;p>o KM^f = H^^ ^) < have the inequality < Emi,...,m, IlLi -^"^i^ Er^i (l^miixrf < 

(scr^)-'. Thus, for i > j and from (I47p and f HS]) we obtain the following convergence 
in probability when n — )■ oo: 



/ ^ '-^q.m 
qi,...,qi-j>p* 
mi,...,mj>p* 



V 



Therefore, from (144|) we obtain the following convergence when n — >■ oo: 



k{x 



n \ n J \ n'^ 

Following the same guideline as previously, it can be shown that when — oo 

1 fMY fL'MY"^ V 



(49) 



k{x 



n \ n 



k{x) -^0 \/i<] 



(50) 
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From the convergences (H9|) and ( !50|) . we deduce the following one when n — )■ oo: 



n \ n 



Therefore, to complete the proof we have to show that: 



h{xf (—\k{x) 
n \ n- J 



Let us consider for a fixed j > 1: 



p>p* 



-kixf k{x) = J2 

^ ^ m\,...,mj>p* 



n 



with m = (mi, . . . , rrij) and: 

\ r=l / \ r=l 



n 

1=1 r=l 



i=l 



< 



n 

i=l r=l 



(51) 



From Cauchy-Schwarz inequality, we have: 

-Y^k{x,Xrf\\[-Y^\^^ct^mX^rf (52) 
r=l / i=l r=l 

j ^ ns 

scr^]^-EAm^0m,(a;r-)^ (53) 



Therefore, considering the expectation with respect to the distribution of the 
points {xr)r=i,...,ns) wc havc: 

\i=i / ti,...,tj=i 
The following inequality holds uniformly in ti, . . . , tj = 1, . . . , ns: 

3 

\{<i>mAXu) 



i=l 
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where bm = Y.V£U{{l,...,j})Ylr=l'^^^ [Ylieu ^rnX^) ] because the term of left hand 

side of the inequahty is equal to one of the terms in the sum of the right hand side. 
n({l, . . . is the collection of all partitions of {1, . . . , j} and fl J^' = 0, Vr 7^ r'. 
We hence have: 



i=l 



Since J2p>o ^p'Ppi^Y < cr^; we have: 



mi,...,mj>p* 1=1 



mi,...,mj>p* 1=1 -p(=n({l,...,j}) r=l 

I 



ieir 



ren{{i,...,j}) r=i 

P = U\.^^Ir 

< a2^#{n({i,...,j})} 



mi>p* 



Since the cardinality of the collection n({l,...,j}) of partitions of is 
finite, the series V ^ , Ff-' -, Xm bm converges. Furthermore, as it is a series 

with positive terms, Ve > 0, 3p > j»* such that : 

3 

where designs the complement of Mp defined by the collection of m = (mi, . . . ,mj) 
such that: 

M = {m = (mi, . . . ,mj) such that mj > p* , i = 1, . . . ,j} 



Mp = {m = (mi, . . . , rrij) such that p* < nii < p, i = 1, . . . , j} 

= M\ Mp 

Therefore, we have \fS > 0, \/s > 3p > such that uniformly in n: 
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Applying the Markov inequality, we obtain: 



6 



P E Wnl\x)\>- \ <e 



(54) 



Furthermore, by denoting am{x) = lim„_>.oo am^(x), we have: 

j 

(^m{-^^ 'S'' -^mi -^mj 0mi ('^)0mj ('^) '^milmi = ---=mj 

i=l 

and from Cauchy-Schwarz inequality (see equation ( l53i) ). we have: 

j 

|a^(x)| <s^+V^J]A^, 

i=l 

We hence can deduce the inequality: 

5^ |a^(x)|<s^-+V^ flX^^ 



(55) 



(56) 



meM. 



Thus, 3p such that XlmeM? I'2m(a;)| < | for all x G M'^. From the inequalities ([5 
and ( !56|) . it can be shown that 3p such that: 



P 



mGM 

Since: 



meM 



> 25 < £ + P 



meMp 



E °'»(^) 



> 6 



lim sup P 

n— >-oo 



> 5 =0 



>26 \ <e We>0 



E ^^^(^) " E 

meMp meMp 

then: 

lim sup P ^ "mH^^) - E '^'"(^) 

\ meM meM 

The previous inequality holds Ve > 0, thus we have the convergence in probability 
of XlmGAf (^m\x) to XlmgA/ '^m(a^) with (by usiug the limit in the equation (l55l) ): 

meM p>p* 

Finally, we have the following convergence in probability when — oo: 

■ s \ i+i 



k{xy {L~'MyL-^k{x) 



1 ; / \ "^oo 



r 



(57) 
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We highlight that we cannot use the strong law of large numbers here due to the 
infinite sum in M. 

From the equation (l37l) and the convergences (l40l) and ( ISTl) . we obtain the fol- 
lowing convergence in probability: 



By considering the asymptotic g — oo and the inequality sXp* < r, we obtain the 
following upper bound for o"^(x): 



limsup(T^(a;) < ( ^ ] ^pi^Y 



(59) 
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